The graph in the centre depicts the difference between the human and Spotify ratings for energy and valence, where the human ratings were calculated as the mean of all viable survey responses in the respective categories. More precisely the human ratings were subtracted from the Spotify ratings, so a positive difference in energy and a negative difference in valence, as with Chinese contemporary music for example, means that the survey respondents rated these songs less energetic but more emotionally positive. Both human and Spotify ratings lie on a scale between 0 and 1, so the differences theoretically live within \([-1, 1]\), although in practice no absolute values larger than \(0.5\) were observed. The two-dimensional differences are shown in the scatter plot for each of the eight song categories. Furthermore an overall difference was also defined as the distance between the markers and the plot origin, i.e. \[ \texttt{Overall Difference} \;\;= \;\; \sqrt{\texttt{Valence Difference}^2 + \texttt{Energy Difference}^2} \] This value quantifies the overall inaccuracy of the Spotify energy and valence ratings. The bar chart which can be selected through the button in the top left plots these values for each of the eight categories.
The interactive maps on the right depict the overall difference between human and Spotify ratings depending on the respondents’ birth country. There is one plot for each of the four music cultures studied, which can be toggled using the radio buttons on the top right. The higher the mean rating difference of all respondents from a specific country is the more saturated the shading of that country is in the plot. Note that the accuracy of these values highly depends on the number of responses received from that country. As the scope of this project did not allow for very comprehensive data collection on a global scale, there is only a few responses from each country, which does not allow for any significant conclusions. The exact number of respondents can be read when hovering over a country in the plot.
The raw distances between the participants’ and Spotify’s ratings of energy and valence and the sum of those distances, representing total accuracy were scaled by converting them to z-scores. Z-scores describe each observation as its distance from the variable’s mean in standard deviations. The following analyses were conducted on scaled variables.
The main hypothesis that Spotify’s ratings would be less accurate for non-Western than Western songs was not supported. In actuality, an opposite effect was found: the ratings were more accurate for non-Western music (t(1543.1) = -2.64, p = .008), as shown with a Welsch independent-samples t-test.
A closer investigation of at the differences in accuracy with a two-way ANOVA testing for differences across regions, traditionality and their interaction showed a main effect of region (F(3) = 6.79, p < .001) and a main effect of traditionality (F(1) = 12.55, p < .001), but no interaction effects between the two (F(3) = 1.20, p = .310).
A follow-up series of pairwise t-test shed light on the differences in accuracy across regions. Significant differences were found for Western vs Arab, Western vs Indian, Chinese vs Arab and Chinese vs Indian songs. The alpha-level of significance was adjusted with the Bonferroni correction to 0.83%.
A follow-up Welsch t-test showed that the accuracy was higher for contemporary than traditional songs (t(2168.6) = -3.23, p = .001).
To gain a more detailed understanding of the effects, the study assessed the accuracy in valence and energy ratings separately.
A Welsch independent-samples t-test showed that valence ratings were equally accurate for non-Western and Western songs (t(1660.2) = -1.88, p = .060).
A two-way ANOVA testing for differences in valence ratings across regions, traditionality and their interaction found a main effect of region (F(3) = 4.07, p = .007) but no main effect of traditionality (F(1) = 0.90, p = .343). A significant interaction between region and traditionality was observed (F(3) = 2.72, p = .043).
Follow-up pairwise t-test were employed to test the differences in valence ratings’ accuracy across regions. At the Bonferroni-adjusted significance level of 0.83% no significant differences were found across regions.
Energy ratings were more accurate for non-Western than Western songs (t(1270.2) = -5.12, p < .001), as shown with a Welsch independent-samples t-test.
A two-way ANOVA testing for differences in energy ratings across regions, traditionality and their interaction found a main effect of region (F(3) = 25.06, p = < .001) and main effect of traditionality (F(1) = 342.14, p < .001). Furthermore, a significant interaction between region and traditionality was present (F(3) = 28.23, p < .001).
Follow-up pairwise t-test were employed to test the differences in energy ratings’ accuracy across regions. At the Bonferroni-adjusted significance level of 0.83% significant differences were found for Western vs Arab, Western vs Indian, Chinese vs Arab and Chinese vs Indian songs.
A follow-up Welsch t-test showed that the accuracy of energy ratings was higher for contemporary than traditional songs (t(1747.6) = -16.41, p < .001).
We conceptualized and distributed our survey using Qualtrics. Participants were first asked to give their consent, after which they were requested to provide information on their country of birth. Next, a definition of valence and energy were supplied to give each participant a better idea of what to keep in mind when rating each of the following songs. Each participant listened to all song clips in a randomized order, ensuring there were no order effects that could have influenced our results. The survey was distributed by making use of our personal network of family, friends, and acquaintances. After collecting responses for about four weeks, we ended up with a total number of 130 respondents. The countries most represented in our final sample were Slovenia, Slovakia, India, the Netherlands, Romania, and Germany. The collected data was then exported for further analysis.
Our research investigated the accuracy of the Spotify API in ranking the energy and valence of Western and non-Western songs. Our findings suggest that Spotify’s ranking accuracy is not consistent among different regions, and especially among classical and modern music. Overall, genres outside the Western mainstream resulted in lower accuracy, possibly due to Spotify’s focus on Western music and user demographics. Supporting this outcome is the fact that the two least accurately ranked styles turned out to be Western classical and Chinese traditional music.
Spotify is a western company with predominantly western user-base, therefore, from a marketing perspective, training the API to be as precise as possible on current popular music makes sense. However, this means that we also need to be careful at interpreting the values Spotify API puts out, as its main goal is not to create a database of songs full of accurate data that is viable for research, but to predict their user’s musical taste in order to keep them satisfied with the product they are paying for. From an ethical standpoint, the Spotify API should strive for accuracy across all cultures and all genres to avoid perpetuating cultural biases and promote equal representation in music recommendation systems.
Further research is needed to investigate the specific factors that influence Spotify to rank certain genres so differently compared to humans. This research, however, would require access to Spotify’s internal data to understand the specificities of each factor, as well as the inner workings of the scales it ranks the songs on.
The bane of our study was to find out if there is a potential bias in regard to how the Spotify API rates Western and Non-Western songs based on their Energy and Valence. By creating a survey where the participants were asked to rate 23 songs from different cultures, we were able to compare these findings to the ones from Spotify. For our data, we were able to collect 102 entries from participants residing in 26 different countries. Therefore, our findings showing geographical response data are based only on a few people from any given country and in that case should not be considered as representing the whole country. This issue would be solved with a larger sample size.
Another limitation we faced was correctly scaling up our 7-point likert scale and Spotify’s rating of valence and energy on a scale ranging from 0 to 1. Despite the public access to the rating of each song, the method Spotify uses to calculate the actual values remains obscured. Therefore, the most challenging part was a measurement assumption. For this research, we assumed that the distances between values on the Spotify scale are weighted equally. This might not be the case, potentially causing our results to be inaccurate.
In an ideal setting, we would be able to play the full songs to the participants, and not only 15 second snippets we had to resort to with the aim of keeping the questionnaire under 15 minutes long. The limitation here is that Spotify ranks both valence and energy as an average calculated from a whole song, while our participants had only 15 pre-selected seconds to rate each song. While we tried our best to select a representative clip, this might still cause the accuracy of the measurements to deviate.
Table with the clips, song title, artist, culture, modernity, (spotify/human ratings? )
| ID | Name | Audio |
|---|---|---|
| 1 | John | |
| 2 | Jane | |
| 3 | Bob | |
| 4 | John | |
| 5 | Jane | |
| 6 | Bob | |
| 7 | John | |
| 8 | Jane | |
| 9 | Bob | |
| 10 | John | |
| 11 | Jane | |
| 12 | Bob | |
| 13 | John | |
| 14 | Jane | |
| 15 | Bob | |
| 16 | John | |
| 17 | Jane | |
| 18 | Bob | |
| 19 | John | |
| 20 | Jane | |
| 21 | Bob | |
| 22 | John | |
| 23 | Jane | |
| 24 | Bob |